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METHOD FOR TRACKING MOTION OF A FACE 

FIELD OF THE INVENTION 

The present invention is related to the field of digital video processing and 
analysis, and more specifically, to a technique for tracking the three-dimensional (3-D) 
motion of a person's face from a sequence of two-dimensional (2-D) images of the 
person's face that are sequentially received in chronological order. 

BACKGROUND OF THE INVENTION 

Tracking the 3-D motion of a face in a sequence of 2-D images of the face is an 
important problem with applications to facial animation, hands-free human-computer 
interaction environment, and lip-reading. Tracking the motion of the face involves 
tracking the 2-D positions of salient features on the face. The salient features could be in 
the form of (i) points, such as the corners of the mouth, the eye pupils, or external 
markers placed on the face; (ii) lines, such as the hair-line, the boundary of the lips, and 
the boundary of eyebrows; and (iii) regions, such as the eyes, the nose, and the mouth. 

The salient features can also be synthetically created by placing markers on the 
face. Tracking of salient features is generally accomplished by detecting and matching a 
plurality of salient features of the face in a sequence of 2-D images of the face. The 
problem of detecting and matching the salient features is made difficult by variations in 
illumination, occlusion of the features, poor video quality, and the real-time constraint on 
the computer processing of the 2-D images. 

SUMMARY OF THE INVENTION 

The present invention provides an improvement designed to satisfy the 
aferomentioned needs. Particularly, the present invention is directed to a computer 
program product for tracking the motion of a person's face from a chronologically 
ordered sequence of images of the person's face for the purpose of animating a 3-D 
model of the same or another person's face, by performing the steps of: (a) receiving a 
sequence of 2-D images of a person's face; (b) tracking the salient features of the 



METHOD FOR TRACKING MOTION OF A FACE 



FIELD OF THE INVENTION 

5 The present invention is related to the field of digital video processing and 

analysis, and more specifically, to a technique for tracking the three-dimensional (3-D) 
motion of a person's face from a sequence of two-dimensional (2-D) images of the 
person's face that are sequentially received in chronological order. 

10 BACKGROUND OF THE INVENTION 

Tracking the 3-D motion of a face in a sequence of 2-D images of the face is an 
important problem with applications to facial animation, hands-free human-computer 
interaction environment, and lip-reading. Tracking the motion of the face involves 
15 tracking the 2-D positions of salient features on the face. The salient features could be in 
the form of (i) points, such as the corners of the mouth, the eye pupils, or external 
markers placed on the face; (ii) lines, such as the hair-line, the boundary of the lips, and 
the boundary of eyebrows; and (iii) regions, such as the eyes, the nose, and the mouth. 

20 The salient features can also be synthetically created by placing markers on the 

face. Tracking of salient features is generally accomplished by detecting and matching a 
plurality of salient features of the face in a sequence of 2-D images of the face. The 
problem of detecting and matching the salient features is made difficult by variations in 
illumination, occlusion of the features, poor video quality, and the real-time constraint on 

25 the computer processing of the 2-D images. 

SUMMARY OF THE INVENTION 

The present invention provides an improvement designed to satisfy the 
30 aferomentioned needs. Particularly, the present invention is directed to a computer 
program product for tracking the motion of a person's face from a chronologically 
ordered sequence of images of the person's face for the purpose of animating a 3-D 
model of the same or another person's face, by performing the steps of: (a) receiving a 
sequence of 2-D images of a person's face; (b) tracking the salient features of the 



person's face in the 2-D images; and (c) obtaining the 3-D global and local motion of the 
face from the tracked 2-D location of the salient features. 

5 BRIEF DESCRIPTION OF THE DRAWINGS 

In the course of the following detailed description, reference will be made to the 
attached drawings in which: 

FIG. 1 is a perspective view of a computer system for implementing the present 
10 invention; 

FIG. 2 is a first flowchart for the method of the present invention; 
FIG. 3 is a second flowchart for the method of the present invention; 
FIG. 4 is a diagram illustrating the method of placing markers on a person's face; 
FIG. 5 is a diagram further illustrating the method of placing markers on a 
15 person's face; 

FIG. 6a is a diagram illustrating the method of calculating the calibration 
parameter of the camera with a target object; 

FIG. 6b is a diagram illustrating the image of the target object captured by the 
camera; 

20 FIG. 7 is a diagram illustrating the method of acquiring a plurality of neutral 

images of a person's face using the camera; 

FIG. 8 is a diagram further illustrating the method of acquiring a plurality of 
action images of a person's face using the camera; 

FIG. 9 is a first table illustrating the method of locating global and local markers 
25 on the person's face; 

FIG. 10 is a second table illustrating the method of locating global and local 
markers on the person's face; 

FIG. 11 is a table illustrating the method of determining the surface normals of the 
global markers; 

30 FIG. 12 is a table illustrating the method of determining the surface normals and 

the motion planes of the local markers; 
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DETAILED DESCRIPTION OF THE INVENTION 



Referring to FIG. 1, there is illustrated a computer system 10 for implementing 
the present invention. The computer system 10 includes a microprocessor-based unit 12 

5 for receiving and processing software programs and for performing other well known 
processing functions. The software programs are contained on a computer useable 
medium 14, typically a compact disk, and are input into the microprocessor based unit 12 
via the compact disk player 16 electronically connected to the microprocessor-based unit 
12. As an alternate to using the compact disk 14, programs could also be contained in an 

10 Internet server 18 and input into the microprocessor-based unit 12 via an Internet 
connection 20. A camera 22 is electronically connected to the microprocessor-based unit 
12 to capture the 2-D images of a person's face. A display 24 is electronically connected 
to the microprocessor-based unit 12 for displaying the images and user related 
information associated with the software. A keyboard 26 is connected to the 

15 microprocessor based unit 12 for allowing a user to input information to the software. A 
mouse 28 is also connected to the microprocessor based unit 12 for selecting items on the 
display 24 or for entering 2-D position information to the software, as is well known in 
the art. As an alternate to using the mouse 28, a digital pen 30 and a digital pad 32 may be 
used for selecting items on the display 24 and entering position information to the 

20 software. The output of the computer system is either stored on a hard disk 34 connected 
to the microprocessor unit 12, or uploaded to the Internet server 18 via the Internet 
connection 20. Alternatively, the output of the computer system can be stored on another 
computer useable medium 14, typically a compact disk, via a compact disk writer 36. The 
below-described steps of the present invention are implemented on the computer system 

25 10. 

Referring to FIGS. 2 and 3, there are illustrated the ten steps of the present 
invention which are first succinctly outlined and later described in detail. The first five 
steps are the initialization steps of the invention. Briefly stated, the first five steps are as 
30 follows: (a) selecting or placing salient features on the person's face (Step 100); (b) 
calculating the calibration parameter of the camera (Step 110); (c) acquiring a plurality of 
images of the person's face using the camera (Step 120); (d) calculating the 3-D positions 
of the salient features (Step 130); and (e) determining the surface normals and motion 
planes for the salient features (Step 140). The second five steps are the tracking steps of 
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the invention. Briefly stated, the second five steps are as follows: (f) acquiring a 
chronologically ordered sequence of 2-D images of the person's face in action (Step 150); 
(g) locking onto the salient features (Step 160); (h) tracking the global and local motion 
of the face (Step 170); (i) determining tracking failure (Step 180); and (j) storing or 
5 transmitting the global and local motion values (Step 190). 

A. Selecting or Placing Features On The Person's Face For Motion Tracking 
(Step 100) 

10 Referring to FIGS. 3 and 4, in the first step 100, salient features are selected or 

placed on the person's face for tracking the global and local motion of the face. Salient 
features that can be selected for tracking the global motion are the hairline, the corners of 
the eyes, the nostrils, and contours of the ears. Salient features that can be selected for 
tracking the local motion are the eyebrows, eyelids, pupils, and the lips. Methods have 

15 been proposed in the prior art for using the aforementioned salient features to track the 
global and local motion of the face. In a preferred embodiment of the present invention, 
salient features are designed and placed on the face rather than selected from what is 
naturally available on the face. It is important to note that placing salient features on the 
face allows for faster and more reliable motion tracking under adverse conditions for 

20 tracking, such as variations in illumination, poor video quality, and partial occlusion of 
the features. 

Referring the FIG. 4, in a first preferred embodiment of the invention, circular 
markers are placed on a head-set that is worn by the person. The head-set may comprise a 
strap 206 for the skull, a strap 207 for the chin, and a strap 208 for the eyebrows. To 

25 achieve rotation invariance, two concentric circles are used to create the markers; one 
having twice the diameter of the other one, and the small one placed on top of the larger 
one. To achieve the highest contrast, the circles are painted in black and white. Thus, in 
the preferred embodiment, two types of markers are used: black-on-white 213 and white- 
on-black 214 markers. Those skilled in the art understand that other markers may be used, 

30 including and not limited to fluorescent dyes and contrasting paints. 

Referring the FIG. 5, in a second preferred embodiment of the invention, circular 
markers are placed directly on the person's face. Markers are placed on the following ten 
locations on the person's face for tracking the global motion of the face, henceforth they 
are referred to as the global markers: right-ear-base 251, left-ear-base 252, right-temple 
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253, left-temple 254, right-outer-forehead 255, left-outer-forehead 256, right-central- 
forehead 257, left-central-forehead 258, node-base 259, and nose-tip 260. Markers are 
placed on the following six locations on the person's face for tracking the local motion of 
the face, henceforth they are referred to as the local markers: right-lip-corner 261, left-lip- 
5 corner 262, upper-lip-center 263, lower-lip-center 264, right-central-eyebrow 265, and 
left-central-eyebrow 266. 

B. Calculating The Calibration Parameter Of The Camera (Step 110) 

10 Referring to FIGS. 6a and 6b, in the second step 110, a perspective image of a 

target object is captured with the camera with the target object being placed at 
approximately the same distance from the camera as the person's face. The method of the 
present invention uses the perspective image of the target object to calculate a camera 
parameter that is used in the subsequent steps, hereinafter referred to as the E parameter. 

15 It is instructive to note at this point that the E parameter has a non-negative value and it is 
a measure of the amount of perspective deformation caused by the camera. A zero value 
indicates no perspective deformation and the larger the value of the E parameter the more 
the perspective deformation caused by the camera. 

Still referring to FIGS. 6a and 6b, in a preferred embodiment of the invention, a 

20 square-shaped object 211 is employed as the target object and the value of the E 
parameter of the camera is calculated as follows: First, the four corners of the 
quadrilateral 212 are either automatically detected or manually marked by a user on the 
image 213 of the object captured by the camera. Let (x n ,y a ), n = 1,2,3,4, represent 2-D 
the coordinates of the four corners of the object expressed in units of pixels with respect 

25 to the center 214 of the image 213. Letting (X n ,Y n ,Z n ), n = 1,2,3,4, represent the 
corresponding 3-D coordinates of the corners of the object in 3-D in units of meters with 
respect to the location 215 of the camera, the relationship between the 2-D and 3-D 
coordinates are mathematically expressed as follows: 

Z„ 

30 y n =^-FD, 



5 



where F denotes the focal length of camera in meters, and D denotes the meter to pixel 
conversion factor. For the purpose of present invention, it is necessary to find only the 
value of the product FD . hi the present invention, we refer to the inverse of this product 
as the E parameter, hence in mathematical terms 

FD 

We also take advantage of the fact that the target object is square shaped and planar, 
hence letting cd denote the 3-D vector from (X v Y 1 ,Z l ) to (X 2 ,Y 2 ,Z 2 ) and aJ denote 
the 3-D vector from {X X ,Y X ,Z^ to (X 4 , Y 4 , Z 4 ), where / and J are orthonormal vectors 
and a is the size of the square, we have the following mathematical expressions for the 
10 3-D positions of the corners of the square object: 

(X 2 ,Y 2 ,Z 2 ) = {X lt Y lt Z x ) + cd, 
(X 3 ,Y 2 ,Z 3 ) = {X X ,Y X ,Z X ) + cd + aJ, 
(X 4 ,Y 4 ,Z 4 ) = (X^ZJ + aJ . 
It is well known to anyone having knowledge in the field of 3-D geometry that the pair of 
15 3-D orthonormal vectors {I, J) are specified uniquely with 3 real numbers. Thus, on the 
right hand side of the above equation set there is a total of 7 unknown real numbers 
defining the square object: 3 in {I, J), 3 in {X A ,Y X ,Z^), and the size of the square a. 
Hence, including the E parameter, the following set of equations 

_ 1 X n 
X " ~~E~Z^' 

1 F 

20 v = 

EZ n 

has a total of 8 unknown real numbers on the right hand side, and 8 measured quantities, 
namely (x n ,y n ), n =1,2,3,4, on the left hand side, resulting in a unique solution for the 
unknown real numbers in terms of the measured quantities. It is well known to anyone 
knowledgeable in the field of algebra how to obtain the value of the E parameter from the 
25 above equation set given only the measured quantities (x n ,y„), n = 1,2,3,4 . 

C. Acquiring A Plurality Of Images Of A Person's Face Using The Camera 
(Step 120) 
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Referring to FIG. 2, the method of acquiring a plurality of images of a person's 
face using the camera comprises the steps of (1) acquiring neutral images of the face 
(Step 121); and (2) acquiring action images of the face (Step 122). In the following, a 
detailed description of these steps is given. 

5 

CI . Acquiring Neutral Images Of The Face (Step 121) 

Referring to FIGS. 2 and 7, in the third step 120, a plurality of 2-D images of the 
person's face in the same neutral state are captured with the camera from different 

10 directions. The neutral state for the face means that all face muscles are relaxed, eyes are 
normally open, mouth is closed and lips are in contact. These images are subsequently 
used to obtain the neutral 3-D positions of the salient features of the face, hence, 
hereinafter they are referred to as the neutral images. The camera directions to capture 
neutral images are selected so that the majority of salient features are visible in all 

15 images. The face is not required to be at the same distance from the camera in all the 
neutral images. 

Still referring to FIG. 7, in a preferred embodiment of the present invention, 
markers are placed on the person's face as described in Step 100, and fifteen camera 
directions selected for obtaining the neutral images. In order to obtain the neutral images, 

20 the camera remains fixed and the person rotates his/her head to realize the following 
fifteen different directions: front 221, forehead 222, chin 223, angled-right 224, angled- 
right-tilted-down 225, angled-right-tilted-up 226, angled-left 227, angled-left-tilted-down 
228, angled-left-tilted-up 229, full-right-profile 230, full-right-profile-tilted-down 231, 
full-right-profile-tilted-up 232, full-left-profile 233, full-left-profile-tilted-down 234, and 

25 full-left-profile-tilted-up 235. 

C2. Acquiring Action Images Of The Face (Step 122) 

Referring to FIGS. 2 and 8, in the third step 120, a plurality of 2-D images of the 
30 person's face in action states are captured with the camera from different directions. The 
action states for the face include faces with a smiling mouth, a yawning mouth, raised 
eyebrows, etc. These images are subsequently used to obtain the 3-D position of the local 
salient features when the face is in action states, hence, hereinafter they are referred to as 
the action images. The camera directions to capture the action images are selected so that 
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the majority of salient features are visible in all images. The face is not required to be at 
the same distance from the camera in all the action images. 

Still referring to FIG. 8, in a preferred embodiment of the present invention, 
markers are placed on the person's face as described in Step 100 and five facial action 

5 states and two camera directions for each action are selected. The facial action states are 
as follows: smiling mouth, yawning mouth, kissing mouth, raised eyebrows, and 
squeezed eyebrows. The camera directions are front and right. In order to obtain the 
action images, the camera remains fixed and the person rotates his/her head while his/her 
face assumes an action state to capture the following ten different images: front-yawning- 

10 mouth 241, right-angled-yawning-mouth 242, front-smiling-mouth 243, right-angled- 
smiling-mouth 244, front-kissing-mouth 245, right-angled-kissfng-mouth 246, front- 
raised-eyebrows 247, right-angled-raised-eyebrows 248, front-squeezed-eyebrows- 249, 
right-angled-squeezed-eyebrows 250 . 

15 D. Calculating The Neutral 3-D Positions Of The Salient Features (Step 130) 

Referring to FIG. 2, the method calculating the neutral 3-D positions of the salient 
features comprises the steps of (1) locating the global and local salient features in the 
neutral and action images (Step 131); (2) calculating the 3-D positions of the global and 
20 local salient features for the neutral face (Step 132); and (3) calculating the 3-D positions 
of the local salient features for the action faces (Step 133). In the following, a detailed 
description of these steps is given. 

Dl. Locating The Global And Local Salient Features In The Neutral And Action Images 
25 (Step 131) 

The salient features are automatically or manually located on the acquired images. 
It is important to note that not all of the salient features may be visible in all neutral and 
action images and some salient features may not be in their neutral position in some 
30 action images. Thus, in the present invention, the location of only the visible salient 
features and salient features that are in their neutral position are automatically or 
manually located in each neutral and action image. 

In a preferred embodiment of the invention, markers that are placed on the face 
are used as the salient features as described in Step 100. These markers are manually 

8 



located in the neutral images that are indicated with an X in the table in FIG. 9, and are 
manually located in action images that are indicated with an X in FIG. 10. The markers 
are assumed as invisible in those neutral images that are not indicated with an X in the 
table in FIG. 9. The markers are not in their neutral position in those action images that 
5 are not indicated with an X in the table in FIG. 10. In operation, the computer program 
prompts the user to manually locate only the visible markers and markers that are in their 
neutral position in each image. 

D2. Calculating The 3-D Positions Of The Global And Local Salient Features For The 
10 Neutral Face (Step 132) 

Given the 2-D locations of the salient features in the neutral images where they 
are visible, and the value of the E parameter of the camera obtained in Step 110, the 3-D 
positions of the salient features of the person's face are calculated using a modified 

15 version of the method in "Shape and Motion from Image Streams under Orthography: A 
Factorization Method" by Carlo Tomasi and Takeo Kanade, International Journal of 
Computer Vision, vol. 9, no. 2, pp. 137-154, 1992. In a preferred embodiment of the 
present invention, global and local markers placed on the person's face as described in 
Step 100 are used as the salient features. In the following, first, a general mathematical 

20 analysis of 2-D image projections of 3-D marker positions is given. Next, the method of 
"Shape and Motion from Image Streams under Orthography" is reviewed. Then, the 
proposed modification to the method of "Factorization of Shape and Motion" is 
presented. 

25 Without loss of generality, assume that the coordinate axes of the camera system 

are the unit vectors i = (1,0,0) , j = (0,1,0) , and k = (0,0,1) . Hence, the image plane 
passes at (0,0,-F)and is perpendicular to k . Let N denote the number of global 
markers and P n , n = l,...,N, denote the coordinates of the global markers with respect to 
the origin (0,0,0) of the camera system. Likewise, let M denote the number of local 

30 markers and Q n , n = l,...,M, denote the coordinates of the local markers with respect to 
the origin (0,0,0) of the camera system. Clearly, as the person's face is moved, the 
coordinates, of all the markers are changed. It is therefore more appropriate to use a local 
coordinate system for the face to represent the coordinates of the markers. Let the unit 
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vectors T , J , and k denote the coordinate axes for an arbitrary local coordinate system 
for the face. The origin C 0 of the local coordinate system is defined to be the centroid of 
the markers and is given by 

c 0 = — [ —if J P n +YQ n )- 

5 Furthermore, let A n , n = \,...,N, denote the coordinates of the global markers and let 
B n , n = \,...,M, denote the coordinates of the local markers with respect to the origin of 
the local coordinate system. Thus, as the person's face is moved, the origin of the local 
coordinate system is changed but the local coordinates of the markers always remain 
fixed. 

10 In order to relate the global coordinates P n , n = l,...,N, and Q n , n = l,...,M, to 

the local coordinates A n , n = l,...,N, and B n , n = l,...,M, define the unit vectors 
I = (T x J x ,k x ) , J = (J y J y ,k y ) , and K = (i z ,J z ,k 2 ) , where the subscripts x, y, mdz , 

denote the coordinates of the respective vectors along the axes i , j , and k of the global 
coordinate system. Then, the relationship between the global coordinates and the local 
15 coordinates of the feature points is given by 

p ny = Q >y +A n *J, Q ny = C 0 y +B n »J and 

P n,z = Co,, +A*K> Qn,, = Q>,z + B n m K 

20 where • denotes the inner product of two vectors. Finally, the 2-D coordinates of the 
feature points projected on the image plane are expressed as 

1 C 0x +A H mf 1 C Qx +B n *I 

= — _ q = • _ and 

^ n ' x E C 0 _ z + A n • K • EC 0 , z +B n *K 

1 C 0>y +A n *J _ l C 0 >y + B n • J 

Pn ' y ~ E C 0z +A n *K q "' y ~ E C 0;Z +B n *K' 
25 where the quantities on the left hand side are in units of pixels while the quantities of the 
right hand side, except the E parameter and the unit vectors, are in units of meters. The 
above equations can be rewritten with all quantities in units of pixels as follows: 
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t +S»I c 0x +L n *I 



A + ES n »K ' A, + EL n • K 



and 



X + ES •K' "' y /L + EL • K 



where W is some constant in units of meters that will be defined shortly. 

It is now time to write the above equations for all neutral images. Suppose the 
number of neutral images is F , then the general equations for 2-D projections of markers 
10 are 

c f o, x +S*I f f c f o,*+L n »i f 



and 



*'* k f + ES n • K f ' k f + EL n • K f 

f _c f o, y +S n *J f f _ c f o, y +L n *J f 
P ~ A f +ES n *K f ' 9 "' y ~ X f + EL »K f ' 



where / = l,...,F, denotes the image number. Note that all quantities in the above 
equations vary with the image number, except for the local coordinates of the markers 
and of course the E parameter. The parameter W has the same value for all / = l,...,F, 
otherwise its value is arbitrary. 

The method of "Shape and Motion from Image Streams under Orthography" 
assumes a special form of 2-D projection, namely, the orthographic projection. In 
orthographic projection, it is assumed that C 0z is the same for all images, W = C 0z , and 
W » 1 . Thus, the above general equations reduce to the following form in the case of 
orthographic projections: 

p f n.x=c f 0 ^+S a »I f , q f n, x =c f 0 ,x+B n »I f and 

pf nty = C f 0,y +S n »J f , q f n,y = C f 0,y +B n »J f . 
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The quantities on the left hand side are measured quantities while the quantities on the 
right hand side are unknown quantities. The method of "Factorization of Shape and 
Motion" solves the above equations for the 3-D local coordinates S n and L n of the 

global and local markers, respectively, the orientation vectors I f and J f , and the 2-D 
5 position (c f o,x,c f o, y ) of the centroid of the markers in all images in terms of the 2-D 
projected positions (p f n , x ,p f \, y ) and (q f „, x ,q f „, y ) of the global and local markers, 
respectively, in all images. 

In the following, a modification to the method of "Shape and Motion from Image 
10 Streams under Orthography" is presented in order to solve the general 2-D projection 
equations given above for the 3-D local coordinates S n and L n of the markers, the 
orientation vectors I f and J f , the 2-D position (c f o,x,c f 0 , y ) of the centroid of the 
markers, and the distance ratio X f in all images in terms of the 2-D projected positions 
(p f n,x,p f n, y ) and (q f „,x,q f n, y ) of the markers in all images. Note that the third 
15 orientation vector K f is uniquely defined by the first two orientation vectors I f and J f 
simply as 

K f =i f xP, 

20 where x denotes the vector outer product. The proposed modification method is an 
iterative procedure whose steps are as given below: 

1. Use the method of "Shape and Motion from Image Streams under 
Orthography" that employs the orthographic projection equations to calculate 

25 S n for n = l,...,N, L n for n=l,...,M, I f , J f and (c f o, x ,c f 0 , y ) for 

/ = l,...,F, given the 2-D measurements (p f n , x ,p f n , y ) , (q f „, x ,q f „, y ) and the 

visibility information of the markers. Let K f ~I f xj f . 

2. Calculate X f for / = l,...,F, using the general projection equations as 

30 
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x f =- 



£f(c\, + S B .l',c\ y + S n .j']\-fp'n.,,P , .4\ES H . 



3. Modify the 2-D measurements (p f „, x ,p f n , y ) , n = l,...,N, and {q f n , x ,q f „, y ) , 
n = l,...,M, for / = 1,...,F, using the calculated values in Steps 1 and 2 as 

p f n ,x <^p f n , x (X f +ES n *K f ), q f „, x ^q f n , x (X f +EL n »K f ) and 
p f n , y <- //^(l 7 +ES n *K f ), q f n, y <^q f n, y (X f +EL n *K f ). 

4. Repeat Steps 1, 2, and 3 until a predetermined number of iterations has been 
reached, or the following average measurement of matching error 



II 



f C f o, x +S n •I f f C f 0, y +S n *J f 

P "' X X s +ES •K 1 ' P "' 7 X f +ES n *K f 



f c f o,x +L n »j f f c f o, y +L n *J f 
1 n ' X X f + EL„ • K f ' 9 X f +EL»K f 



goes below a predetermined threshold, where the summation is only over the 
visible markers in each image, the quantity V denotes the total number of visible 
markers in all images, and (p f „, x ,p f n , y ) an (q f n , x ,q f », y ) are the original 2-D 
measurements. In a preferred embodiment of the invention, the number of 
iterations is selected to be 50 and the threshold is selected to be 1 pixel. 



20 The 3-D positions S n , n = l,...,N, and L n , n = l,...,M, of the global and local 

markers are globally translated and rotated so that they correspond to a frontal-looking 
face. In a preferred embodiment of the invention, the 3-D positions of the global markers 
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right-ear-base 251, left-ear-base 252, nose-base 259, and nose-tip 260 are used to globally 
translate and rotate the the 3-D positions of the global and local markers so that they 
correspond to a frontal-looking face. Letfj and r 2 denote the 3-D positions of the right- 
ear-base 251 and left-ear-base 252, respectively; / denote the 3-D position of the nose- 
5 base 259; and b denote the 3-D position of the nose-tip 260. Then, the following 
procedure is used to globally translate the positions of the markers: 



Following the global translation of the markers, in a preferred embodiment of the 
15 invention, the following procedure is used to globally rotate the marker positions so that 
they correspond to a frontal-looking face: 

1 . Define the following three vectors 



1 . Define the following vector 



2. Subtract c from each S n and L n , i.e., 
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S n < ~ S n~ C and L n^ L n~~ C 

so that the center of the feature points is shifted to the mid-point of the right-ear- 



base 251 and the left-ear-base 252. 



u = 




20 



2. Use the Gram-Schmidt orthonormalization procedure to convert the vectors 
u , v , and w into an orthonormal set of vectors. As a result, u simply will be 
normalized; only the component of v that is perpendicular to u will be 



retained and subsequently normalized; and only the component of w that is 



perpendicular to both u and the modified v will be retained and subsequently 



25 



normalized. 



3. Form the 3x3 rotation matrix T so that the columns of T consist of the 



orthonormalized vectors u,v, w,i.e., 
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T -\u v w\. 

4. Finally, left-multiply each S n and L n with T , i.e., 
5 S n <-TS n and L n <^TL n . 

D3. Calculating The 3-D Positions Of The Local Salient Features For The Action Faces 
(Step 133) 

10 

Given the 3-D positions of the salient features obtained in Step 132, the 2-D 
measurements of the salient features obtained in Step 131, and the value of the E 
parameter of the camera obtained in Step 110, the method of calculating the 3-D positions 
of the local salient features for the action faces is disclosed in the following. In a 

15 preferred embodiment of the present invention, global and local markers placed on the 
person's face as described in Step 100 are used as the salient features. First, the position 
and orientation of the person's face in the action images are calculated using the 3-D 
positions S n of the global markers and the 2-D measurements (p f n,x ,p f n, y ) of the global 
markers in the action images. Then, the 3-D positions l£ } of the local markers in the 

20 action states are calculated using the position and orientation of the person's face in the 
action images and the 2-D measurements (q f n , x ,q f n, y ) of the local markers in the action 
images. 

It facilitates understanding to note that the 3-D position of the face in an image / 
25 is described by the centroid (c f o, x ,c f o, y ) of the markers and the camera-distance-ratio 
A/ of the face in that image. Likewise, the 3-D orientation of the face in an image / is 
described by the vectors I f and J f in that image. The 3-D position and orientation 
parameters (c f o,x,c f o, y ), Af , I f and J f in the action images are calculated using the 
following steps: 

30 
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1. Use the motion-only-estimation method of "Factorization of Shape and 
Motion" that employs the orthographic projection equations to calculate I f , 
J f and (c f o,x,c f o, y ) in the action images given the 2-D measurements 
(p f n,x,p f n, y ) and the visibility information of the global markers in the 
action images, and the 3-D positions S n of the markers calculated in Step 132. 
Let K f =I f xj f . 

2. Calculate X f using the general projection equations as 



X =-» 1 SfK- +S, •/',</„., +S. .j'\-\{p'„,p>,,;\\ES„ .K> 



3. Modify the 2-D measurements (p f „, x ,p f n , y ) for n=l,...,N, using the 
calculated values in Step 1 and 2 as 

p f n,x <- p f n, x (X f +ES n *K f ), and 

P f n,y <^ p f n,y(A f + ES „ • K f ) . 

4. Repeat Steps I, 2, and 3 until a predetermined number of iterations has been 
reached, or the following average measurement of matching error 



1 N 



C f 0, x +S„*I f f _C f 0,y +S n *J f 

X f +ES*K r ? "' y X f +ES»K f 



for the image goes below a predetermined threshold, where the summation is 
only over the visible points in the image, the quantity U denotes the total 
number of visible points in the image, and (p f „, x ,p f n , y ) are the original 2-D 
measurements. In a preferred embodiment of the invention, the number of 
iterations is selected to be 50 and the threshold is selected to be 1 pixel. 
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The 3-D positions L n of the local markers for the action faces are then calculated 
using the following steps: 



1. Use the shape-only-estimation method of "Factorization of Shape and 
Motion" that employs the orthographic projection equations to calculate the 3- 
D positions l£ } of the local markers in action state i given the position 
(c f o,x,c f o, y ) and orientation I f , J f of the face, the measurements 
(q f n ,x,q f n, y ) , and the visibility information of the local markers in the 2-D 
images of the action state i. Referring to FIG. 8, in a preferred embodiment of 
the invention, there are 5 action states where action state z'=l corresponds to a 
yawning mouth 241 and 242, action state i=2 corresponds to a smiling mouth 
243 and 244, action state z'=3 corresponds to a kissing mouth 245 and 246, 
action state z=4 corresponds to raised eyebrows 247 and 248, and action state 
z*=5 corresponds to squeezed eyebrows 249 and 250. 

2. Modify the 2-D measurements {q f „, x ,q f n , y ) for n = l,...,M, and for each 
action state i, using the calculated values in Step 1 as 

q f n , x <^q f n, x (A f +EL ( ? *K f ), and 

q f n,y ^q f n,y(A f + El}? • K f ) . 



3. Repeat Steps 1 and 2 until a predetermined number of iterations has been 
reached, or the following average measurement of matching error 



for the image goes below a predetermined threshold, where the summation is 
only over the visible points in the image, the quantity U denotes the total 
30 number of visible points in the image, and (q f n ,x,q f n, y ) are the original 2-D 



17 



measurements. In a preferred embodiment of the invention, the number of 
iterations is selected to be 50 and the threshold is selected to be 1 pixel. 

5 E. Determining The Surface Normals And Motion Planes For The Salient 

Features (Step 140) 

Referring to FIG. 2, in the fifth step, a surface normal is defined for each marker. 
The surface normals are used during the tracking process to determine if a marker is 

10 visible in a 2-D image. The surface normal for a marker is defined to be the vector 
perpendicular to the surface of the face at the location of the marker. In a preferred 
embodiment of the invention, the vectors given in the table in FIG. 11 are defined as the 
surface normals for the global markers. The surface normals for local markers are given 
in FIG. 12. It should be noted that the surface normals given in the tables in FIGS. 11 and 

15 12 are not necessarily normalized. They can be normalized to so that they all have unit 
length. The normalized surface normals for the global and local markers are denoted by 
Q n for n=l,...,N, and T„ for n = l,...,M, respectively. The surface normals for the 
markers are used later in Step 170 to determine the visibilities of the markers in a 2-D 
image. 

20 

F. Receiving A Chronologically Ordered Sequence Of 2-D Images Of The 
Person's Face In Action (Step 150) 

Referring to FIG. 3, in the sixth step 150, a video of the face of the person in 
25 action is received. The 2-D images of the video are processed to track the salient features 
on the face and to calculate the global and local motion of the face in the order they are 
received. 

G. Locking Onto The Selected Features On The Person's Face (Step 160) 

30 

Referring to FIG. 3, in the seventh step 160, a locking method is used to start 
tracking the salient features of the face. The locking method is used at the very beginning 
of the tracking process or whenever the tracking is lost, as described in Step 190. initial 
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images of the video are used to lock the tracking process onto the salient features on the 
face. 

In a preferred embodiment of the invention, cross-like signs are displayed on top 
5 of the 2-D image to be associated with the markers on the face. The locations of the signs 
are determined by projecting the 3-D positions of the markers obtained in Step 132 
assuming a frontal orientation of the face. To achieve locking, the person looks directly at 
the camera so as to produce a frontal view positioned at the center of the image. The 
person moves his/her face back and forth and also rotates his/her face if necessary until 
10 all the markers on his/her face are positioned at approximately the same location as the 
associated signs on the display. The method of the present invention considers the 
locations of the cross-like signs as the predicted locations for the features and uses the 
method of Step 173 to calculate the current motion in the 2-D image. If the calculated 
motion corresponds to a frontal orientation at the center of the display, then the method of 
15 the present invention considers a lock onto the features has been achieved. 

H. Tracking The 3-D Global And Local Motion Of The Face In Each 2-D Image 
(Step 170) 

20 

Referring to FIG. 4, the method finding the 3-D global and local motion of the 
face in each 2-D image comprises the steps of (1) predicting the global motion (Step 
171); (2) detecting the global salient features (Step 172); (3) estimating the global motion 
(Step 173); (4) predicting the local motion (Step 174); (5) detecting the local salient 
25 features (Step 175); and (6) estimating the local motion (Step 176). In the following, a 
detailed description of these steps is given. 

HI. Predicting The Locations of Global Salient Features (Step 171) 

30 The global motion of the face in a 2-D image is defined to be the 3-D orientation 

and position of the face in the 2-D image. Referring to FIG. 4, in the eighth step, the 
global motion of the face in a 2-D image that is currently processed is predicted from the 
motion of the face in the previously processed 2-D images. In a preferred embodiment of 
the invention, the calculated position and orientation of the face in the immediate 
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previous 2-D image is used as the prediction for the global motion in the current 2-D 
image. Thus, the predicted locations (p nyX ,p n , y ) for n = l,...,N, of the global markers in 
the current 2-D image are calculated using 



Pn,x = 



l+ES n *K 



Pn,y = 



C Q , y +S n *J 

I + ES n *K 



where, (c 0x ,c 0y ), X , / and J denote the global motion parameters found in the 



previous 2-D image, and K = 1 xJ . 



H2. Detecting The Global Salient Features (Step 172) 



The method of detecting the global markers in the current 2-D image is comprised 
of the following steps: 

1. Determine the visibility indices of the global markers: Calculate the visibility 
index co n for each global marker: 



It is important to note that the closer the value of the index co n to 1, the more 
visible is the global marker. 

2. Design correlation filters for detecting the markers: It is important to note that 
the two concentric circles that form a global marker will appear like two 
concentric ellipses in the current 2-D image. The minor axis of the ellipse will 

be in the direction of the vector [ / • Q„, J • Q„ j , and the length of the minor 



axis will be iC'»Q B |/2cr JI while the length of the major axis will be Ra„, 
where R is the diameter of the outer circle in units of pixel and a n is given by 
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1 

<T n = ~ • 

X + ES n *K 

Thus, in order to detect global marker n in the current 2-D image, a 2-D 
correlation filter is designed that has the support given by the outer ellipse and 
having the value of 1 inside the inner ellipse and the value of 0 elsewhere. Let 
the coefficients of the 2-D correlation filter for the global marker n be given 
by c n {x,y) . 

3. Detect the global markers: If the visibility index (o n of global marker n is 
larger than a visibility threshold, then apply the correlation filter c n (x,y) 
designed in Step 2 for the global marker n in a WxW square region centered 
at the predicted location {p niX ,p n , y )of the global marker n to obtain a 
correlation surface f n (i, j) for the global marker n: 

WW 

f n {ij) = Y. c n^y)- I ^ x+i+ p^y + j + p^^ ~Y < ''- /< T' 

where the summation is over the support of the correlation filter c n (x,y) and 
I(x,y) denotes the intensity distribution of the 2-D image with the center of 
the image being at (0,0). In a preferred embodiment of the invention, the 
visibility threshold is selected as 0.25 and the size W of the square region is 
selected as 20 pixels. Find the location (i*„,/ n ) where the correlation surface 
f n (i,j) achieves its peak value. Then, the image location 
(x n ,y„) = (i* n +p„ x ,f„+P„ y ) is assigned as the detected location of the 
global marker n in the current 2-D image. Let Q n denote this peak value. 

4. Eliminate superfluous and multiple detected locations: If the distance between 
any two detected locations is less than a distance threshold, but larger than 
zero, then discard the detected location that has a smaller peak value. On the 
other hand, if the exact same location is detected for more than one global 
marker, then assign the detected location only to the global marker that has the 
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largest visibility index. In a preferred embodiment of the invention, the 
distance threshold is selected to be 1 pixel. All global markers that are not 
assigned a valid detected location are assumed invisible for the purpose of 
estimating the global motion that is done in the following Step 173. 

H3. Estimating The Global Motion (Step 173) 

Suppose, at the end of Step 172, there are L valid detected locations assigned to L 
global markers. The 3-D orientation / and J , and the 3-D position (c 0>JC ,c 0j; ) , A , of the 
face in the current 2-D image are then calculated from these L detected locations using 
the following steps: 

1. Use the motion-only-estimation method of "Factorization of Shape and 
Motion" that employs the orthographic projection equations to calculate / , J 
and (c 0x ,c 0y ) given the 2-D locations (x n ,y n ) and the visibility information 
of the global markers in the action images, and the 3-D positions S n of the 
markers calculated in Step 132. Let K = I x J . 

2. Calculate A using the general projection equations as 

where the summation is only over the visible global markers. 

3. Modify the 2-D locations (x n ,y n ) using the calculated values in Step 1 and 2 
as 

x n <^x n (A + ES n •!), and 
v„ <- y n (A + ES n »K). 
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4. Repeat Steps 1, 2, and 3 until a predetermined number of iterations has been 
reached, or the following average measurement of matching error 



£ = — > \ x„ : y„ - - 

I , X + ES n *K Z + ES n *K) 

V v 7 J 

5 

goes below a predetermined threshold, where the summation is only over the 
visible global markers, hi a preferred embodiment of the invention, the 
number of iterations is selected to be 50 and the threshold is selected to be 1 
pixel. 

10 

H4. Predicting The Locations of Local Salient Features (Step 174) 

The local motion of the face in a 2-D image is defined through an action vector that 
represents the actions of the face in the 2-D image. In a preferred embodiment of the 

15 invention there are a total of 5 actions, hence the action vector has 5 components: 
^ = {Amy ' ^MS ' A MK , A ER , A ES ) 
A MY being the amount of yawning-mouth action, A MS being the amount of smiling- 
mouth action, A MK being the amount of kissing-mouth action, A ER being the amount of 
raised-eyebrows action, and A ES being the amount of squeezed-eyebrows action. For 

20 example, an action vector A = (0.5, 0.0, 0.0, 1.0, 0.0) represents a half-yawning mouth 
and fully raised eyebrows. 

As mentioned in Step 133, there are 5 action states. It facilitates understanding to give 
examples of action vectors for the action states. Action state i=\ that corresponds to a 
25 yawning mouth has action vector A = (l.O, 0.0, 0.0, 0.0, 0.0) while action state ;'=5 that 
corresponds to squeezed eyebrows has action vector ^4 = (0.0, 0.0, 0.0, 0.0, 1.0). The 
neutral state of the face is represented by the action vector A = (0.0, 0.0, 0.0, 0.0, 0.0). 

During the locking process explained in Step 160, the face is in the neutral state, hence 
30 the action vector is given by A = (0.0, 0.0, 0.0, 0.0, 0.0). In any subsequent 2-D image 
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of the face, the action vector found for the previous image is used as the predicted action 
vector for the current image. Let A denote the predicted action vector for the current 
image. Let 

L n =(l£) -L n ,L™ -L n ,Lf -L H ,L™ -L n ,L^ - L n ) 

5 

denote the action displacement vector for local marker n. Then, the predicted 3-D 
positions L n of the local markers in the current image are calculated as follows: 
L n =A»L n +L n . 

Finally, the predicted 2-D locations (q n!X ,q Kty ) of the local markers in the current image 
10 are calculated using 

_ c 0x +L n *I _ c 0 +L n »J 

q«, x =^r Z ~» <ln, y =-Z Z ~> 

X + EL n • K X+EL„*K 

where, (c Qx ,c Qy ), X, I and J denote the global motion parameters found in the 
previous 2-D image, and K - J xj . 

15 

H5. Detecting The Local Salient Features (Step 175) 

The method of detecting the global markers in the current 2-D image is comprised 
20 of the following steps: 

1. Determine the visibility indices of the local markers: Calculate the visibility 
index y/„ f° r eacn local marker: 

25 It is important to note that the closer the value of the index y/ n to 1, the more 

visible is the local marker. 

2. Design correlation filters for detecting the markers: It is important to note that 
the two concentric circles that form a local marker will appear like two 
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concentric ellipses in the current 2-D image. The minor axis of the ellipse will 
be in the direction of the vector \ I • x ¥ n ,J • x ¥ n , and the length of the minor 



axis will be • V F H ji?//„ while the length of the major axis will be RjU n , 
where R is the diameter of the outer circle in units of pixel and / u n is given by 



Thus, in order to detect local marker n in the current 2-D image, a 2-D 
correlation filter is designed that has the support given by the outer ellipse and 
having the value of 0 inside the inner ellipse, the value of 1 in the outer 
ellipse, and the value of 0 elsewhere. Let the coefficients of the 2-D 
correlation filter for the local marker n be given by d n (x,y) . 

Detect the local markers: If the visibility index y/ n of local marker n is larger 
than a visibility threshold, then apply the correlation filter d n (x,y) designed 
in Step 2 for local marker n in a WxW square region centered at the 
predicted location (q n x , q n y ) of local marker n to obtain a correlation surface 
h n (i, j) for local marker n: 

_ W W 

K (i, j) = 2X 0> y) ^(x+i+Vn^y + J+inJ, -—< u J<y> 

where the summation is over the support of the correlation filter d n (x,y) and 
I(x, y) denotes the intensity distribution of the 2-D image with the center of 
the image being at (0,0). In a preferred embodiment of the invention, the 
visibility threshold is selected as 0.25 and the size W of the square region is 
selected as 20 pixels. Find the location (f„,j*„) where the correlation surface 
h n (i,j) achieves its peak value. Then, the image location 
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A + EL n mK 
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(u , v n ) = (z* + g nx , £ +q„ ) is assigned as the detected location of the global 
marker n in the current 2-D image. Let Q n denote this peak value. 

4. Eliminate superfluous and multiple detected locations: If the distance between 
any two detected locations is less than a distance threshold, but larger than 
zero, then discard the detected location that has a smaller peak value. On the 
other hand, if the exact same location is detected for more than one local 
marker, then assign the detected location only to the local marker that has the 
largest visibility index. In a preferred embodiment of the invention, the 
distance threshold is selected to be 1 pixel. All local markers that are not 
assigned a valid detected location are assumed invisible for the purpose of 
estimating the local motion that is done in the following Step 176. 

H6. Estimating The Local Motion (Step 176) 

The local motion of the face is represented by an action vector as described in 
Step 174. In a first preferred embodiment of the invention, the action vector for the 
current image is calculated using the following steps: 

1. Calculate the 2-D displacements of the local markers: The 2-D locations 
(q nx ,q n ) of the local markers corresponding to the neutral face are 
calculated using the global motion found for the current image as: 

q "' x ~ X + EL n »K' q "' y J + EL n • K 
The 2-D displacements id nx ,d ny ) are then calculated as 

d n,x ~ U n~ %n,x > d n,y =V n~ <ln,y • 

2. Modify the 2-D displacements so that they correspond to orthographic 
projection: 

d nx <r- d n x (2 + EL n • K) , and 
d ny <^d ny (A + EL n •K). 
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3. Calculate the 3-D displacements of the local markers: The 3-D displacements 
of the local markers are calculated from the 2-D displacements of the local 
markers, the 2-D motion planes of the local markers, and the global motion of 
the face in the current image. The 2-D motion plane of a local marker passes 
from the neutral 3-D position of the local marker and approximates the motion 
space of local marker with a plane. Two basis vectors are used to define each 
motion plane. Let B ln and B 2n denote the basis vectors 1 and 2 for the local 
marker n. The basis vectors for the motion planes of the local markers are 
given in FIG. 12. The 3-D displacements of the local markers are then 
calculated as follows. Form the matrix M n for each local marker 

B, •! B,*I 
[B u • J B 2n • J 

and solve for the coefficients a X n and a 2 n in 




Then, the 3-D displacements U„ are given by 

U n = a U ■ B Un + a 2,n " B 2,n ■ 

Once the 3-D moved positions of the markers are calculated they can be modified 
so as to satisfy the motion symmetries of the face. Examples of motion symmetries of the 
face are as follows: the right and the left eyebrows move simultaneously and by the same 
amount, and the right and the left corners of the mouth move simultaneously and by the 
same amount. 

The calculated 3-D displacements of the markers can be further modified to 
enforce motion dependencies of the face. An example of a motion dependency of the face 
is as follows: as the corners of the mouth move towards the center of the mouth, the 
centers of the top and bottom lips move forward. 

The calculated 3-D displacements of the markers can be still further modified by 
filtering. The filtering of the calculated 3-D displacements of the face smooth out the 
jitter in the calculated 3-D positions that can be caused by errors in the detected 2-D 
positions of the markers. 
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4. Finally, the action vector {a x ,---,a M ) for the current image is calculated by 
solving the following equation for (a x ,• • -,a M ) in the least-squares sense: 

















_V 




a M 



In a second preferred embodiment of the invention, the action vector for the current 
image is calculated using the following steps: 

1. Calculate the 2-D locations (q„ tX ,q n , y ) of the local markers corresponding to 
the neutral face using the global motion found for the current image as: 

c 0x +L n *I c Gy +L n »J 

q "' x ~ A+EL n *k' 9n,y X+EL n »k' 
Still referring to FIGS. 5 and 9, local marker n=l corresponds to Right-lip- 
corner 261, n=2 corresponds to Left-lip-corner 262, n=3 corresponds to 
Upper-lip-center 263, n=4 corresponds to Lower-lip-center 264, n=5 
corresponds to Right-central-eyebrow 265, and n=6 corresponds to Left- 
central-eyebrow 266. 

Also calculate the 2-D location (p 9 , x ,p 9 , y ) of the global marker n=9 Nose- 
base 259 using the global motion found for the current image as: 

Pn ' x ~ l + ES n P "' y ~ I + ES n *K' 

where n is set to 9 for the global marker Nose-base 259. 

2. Calculate the 2-D locations (q {i) «,x,q {l) n,y) of the local markers corresponding 
to the action states of the face using the global motion found for the current 
image as: 

"' X A,+EL°\»k' X + EL (i \»K' 
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where i is the index of the facial action. Still referring to FIG. 8, in a preferred 
embodiment of the invention, there are 5 action states where action state *'=1 
corresponds to a yawning mouth 241 and 242, action state z'=2 corresponds to 
a smiling mouth 243 and 244, action state i=3 corresponds to a kissing mouth 
5 245 and 246, action state z=4 corresponds to raised eyebrows 247 and 248, and 

action state i=5 corresponds to squeezed eyebrows 249 and 250. 

3. Determine the fractional displacements f m , f (1) , and / (3) for yawning- 
mouth, smiling-mouth, and kissing mouth actions, respectively, as follows. 
10 The fractional displacement f m is determined based on the distance between 

the Upper-lip-center 263 and Lower-lip-center 264 in the yawning-mouth 
action state i=l of the face, and in the neutral state of the face, and the 
distance between the detected positions of those markers: 



Ik - uA 


q 3 , x -q 4 J 




V 3- V 4 












k\ y -q { \ y 





15 

The fractional displacement f (2) is determined based on the distance between the 
Right-lip -corner 261 and Left-lip-corner 262 in the smiling-mouth action state i=2 
of the face, and in the neutral state of the face, and the distance between the 
detected positions of those markers: 

Ik -u 2 \\ \q Ux -q 2 ,x\ 

W2) V l- V 2 Wy-q2,y\ 
?0 f =t, n 1 [?- 

Finally, the fractional displacement f m is determined based on the distance 
between the Right-lip-corner 261 and Left-lip-corner 262 in the kissing-mouth 
action state i=3 of the face, and in the neutral state of the face, and the distance 
25 between the detected positions of those markers: 



29 



Then, clip the values of the fractional displacements f m , / (2) , and/ (3) to the 
range [0,l] and use the following method to determine the first three components 
of the action vector (a l ,---,a 5 ): 

• If / (3) > / (2) and / (3) > f (l) then a x = 0 , a 2 = 0 , and a 3 = f (3) 

• Otherwise, if / (2) >/ (3) and / (2) >/ (1) then a,=0, a 2 = f {2) , and 
a 3 = 0. 

• Otherwise, a x = f m , a 2 = 0 , and a 3 = 0 . 

4. Determine the fractional displacements / (4) and / (5) for raised-eyebrows 
and squeezed-eyebrows, respectively, as follows. The fractional displacement 
f w is determined based on the distance between the local markers Right- 
central-eyebrow 265, Left-central-eyebrow 266, and the global marker Nose- 
base 259 in the raised-eyebrows action state i=4 of the face, and in the neutral 
state of the face, and the distance between the detected positions of those 
markers: 

||0 5 + w 6 )/2-*J |(? 5>JC +q 6sX )/2-p 9 A 
(4)= ||(v 5 +v 6 )/2-j 9 | f.q 5 , y +q 6 , y )/2-p 9)y \ 

\(q W 5,y +q ( - 4) 6,y)/2-p^\\ \(q 5>y +q 6 JI2-p %y \ 

The fractional displacement / (5) is determined based on the distance between the 
Right-central-eyebrow 265 and Left-central-eyebrow 266 in the squeezed- 
eyebrows action state i=5 of the face, and in the neutral state of the face, and the 
distance between the detected positions of those markers: 
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Then, clip the values of the fractional displacements / (4) and f {5) to the range 
[0,l] and use the following method to determine the last two components of the 
action vector (a x , • ■ ■ , a 5 ) : 

• If f (5) >/ (4> then a 4 =0anda 5 = / (5) 

• Otherwise, a 4 = f {4) and a 5 = 0 . 

I. Determining If There Is A Tracking Failure (Step 180) 

If there is a large change in the global motion of the face in the current image as 
compared to the global motion of the face in the previous image, then it is concluded that 
there is a tracking failure. In a preferred embodiment of the invention, the following 
equation is used to calculate the change 5 in the global motion in the current image with 
respect to the previous image: 

8 = ((c f o, x -c f ~\ x ) 2 +(c f o, y -c f -\ y ) 2 +(X f -A f - l ) 2 JE 2 f 
+ 100-(l-|/ / •7 / - 1 | + l-L7 / • J f - l \ + l-\K f •K f ' 1 \) 



In a preferred embodiment of the invention, if 5 is greater than 50 then it is concluded 
20 that there is a motion failure. 

J. Storing Or Transmitting Global And Local Facial Motion Values (Step 190) 

The calculated global motion of the face is in terms of the 3-D orientation vectors 
25 I f and J f , the 2-D centroid (c f o, x ,c f o, y ) of the face, and the camera-distance ratio A/ . 
The superscript / denotes the chronological order number for the motion values. The 
following equations are used to convert the calculated global motion parameters into a 



31 



more direct representation that uses a 3-D rotation matrix R f and a 3-D position vector 



where K f =I f xj f , and the subscripts x, y, and z, denote the x-, y-, and z- components 
5 of a vector. 

Thus, only the global motion parameters R f and T f are stored or transmitted to 
describe the global motion of the face. Likewise, only the action vectors A f are stored or 
transmitted to describe the motion of the face. 

10 



R f = J{ J{ J{ , T f = 

k{ K f y k{ 
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What is claimed is: 

1 . A method for tracking motion of a face comprising the steps of: 
selecting salient features of the face for motion tracking; and 

5 tracking motion of the salient features of the face. 

2. The method of claim 1 further comprising: 
acquiring a plurality of initial 2-D images of the face; 
calculating 3-D locations of the salient features; and 

10 determining a surface normal for each salient features. 

3. The method of claim 1 further comprising: 

receiving a chronologically ordered sequence of 2-D images of the face in action; and 
locking onto the salient features. 

15 

4. The method of claim 1 further comprising: 

tracking 3-D global motion of the face in each image; and 
tracking 3-D local motion of the face in each image. 

20 5. The method of claim 1 wherein the step of selecting comprises fixing markers to the 
face and the step of tracking comprises tracking the motion of the markers. 

6. The method of claim 5 wherein a first set of markers identifies global motion and a 
second set of markers identifies local motion of the face. 

25 

7. The method of claim 4 wherein the step of tracking the 3-D global motion comprises 
the steps of: 

predicting the location of global salient features in a 2-D image; 
detecting global salient features in the 2-D image; and 
30 estimating the 3-D global motion of the face in the 2-D image. 

8. The method of claim 7 wherein the step of estimating comprises calculating the 
position and shape of the face to conform to the 3-D locations and the detected 
locations of the global markers under a perspective projection model. 
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9. The method of claim 4 wherein the step of tracking the 3-D local motion comprises 
the steps of: 

predicting the location of local salient features; 
5 detecting local salient features; and 

estimating the 3-D local motion of the face. 

10. The method of claim 9 wherein the step of estimating comprises: 

finding 3-D locations of local markers to conform to the detected 2-D locations of the 
10 local markers; and 

calculating an action vector representing the weights of facial actions in the 2-D 
image conforming to the found 3-D locations of local markers and the 3-D locations 
of the local markers for the neutral and the action states under a perspective projection 
model. 

15 

1 1 . A method for tracking motion of a face comprising the steps of: 
determining the calibration parameter of a camera; 
selecting salient features on the face for motion tracking; 
acquiring a plurality of initial 2-D images of the face; 

20 calculating 3-D locations of the salient features in accordance with the calibration 

parameter of the camera; 

determining a surface normal for each salient features; 

receiving a chronologically ordered sequence of 2-D images of the face in action; 
tracking motion of the face in each 2-D image; and 
25 storing or transmitting tracked motion of the face. 

12. The method of claim 1 1 further comprising the steps of: 
locking onto the salient features; and 

detecting loss of lock and hence the need for re-locking onto the salient features. 

30 

13. The method of claim 1 1 wherein the step of tracking comprises the steps of: 
tracking the 3-D global motion of the face in each image; and 

tracking the 3-D local motion of the face in each image. 
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14. The method of claim 11 comprising the further step of repeating the locking and 
tracking steps after the detecting step. 

15. The method of claim 11 wherein the step of selecting comprises recognizing salient 
5 facial features and the step of tracking comprises tracking the motion of the salient 

facial features. 

16. The method of claim 1 1 wherein the step of selecting comprises fixing markers to the 
face and the step of tracking comprises tracking the motion of the markers. 

10 

17. The method of claim 16 wherein a first set of markers identifies global motion and a 
second set of markers identifies local motion of the face. 

18. The method of claim 17 wherein the markers comprise at least two colors. 

15 

19. The method of claim 18 wherein the two colors are contrasting. 

20. The method of claim 19 wherein the colors are black and white. 

20 21. The method of claim 16 wherein the markers comprise two concentric circles of 
different colors. 

22. The method of claim 21 wherein the outer circle has a diameter at least twice the 
diameter of the inner circle. 

25 

23. The method of claim 11 wherein the step of selecting comprises wearing a head-set 
with markers. 

24. The method of claim 23 wherein the head-set comprises a strap for a chin. 

30 

25. The method of claim 23 wherein the head-set comprises a strap for eyebrows. 

26. The method of claim 23 wherein the head-set comprises at least one strap for a skull. 



35 



27. The method of claim 1 1 wherein the acquired 2-D images include at least two views 
of the face with markers in a neutral state at different orientations; 

28. The method of claim 27 wherein the two views are orthogonal. 

5 

29. The method of claim 1 1 wherein the acquired 2-D images comprise front, forehead, 
chin, angled-right, angled-right-tilted-up, angled-right-tilted-down, angled-left, 
angled-left-tilted-up, angled-left-tilted-down, full-right-profile, full-right-pro file- 
tilted-up, full-right-profile-tilted-down, full-left-profile, Ml-left-profile-tilted-up, and 

10 full-left-pro file-tilted-down views of the face with markers in the neutral state. 

30. The method of claim 1 1 wherein the acquired 2-D images comprise front, forehead, 
chin, full-right-profile, and full-left-profile views of the face with markers in the 
neutral state. 

15 

31. The method of claim 1 wherein the acquired 2-D images include a plurality of views 
of the face with markers in at least one action state. 

32. The method of claim 31 wherein the action states of the face comprise smiling lips, 
20 kissing lips, yawning lips, raised eyebrows, and squeezed eyebrows. 

33. The method of claim 31 wherein the acquired 2-D images of the face in an action 
state include at least two views at different orientations. 

25 34. The method of claim 33 wherein the two views are front and angled-right. 

35. The method of claim 1 1 wherein the step of selecting comprises fixing markers to the 
face and the step of calculating comprises calculating the 3-D locations of the markers 
placed on the face. 

30 

36. The method of claim 35 wherein the step of calculating the 3-D locations of the 
markers comprises the steps of: 

calculating the 3-D locations of the global and local markers in the neutral state; and 
calculating the 3-D locations of the local markers in each action state; 
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37. The method of claim 36 wherein the step of calculating the 3-D locations of the 
global and local markers in the neutral state comprises the steps of: 

calculating the 3-D locations of the markers to conform to their 2-D locations in the 
5 2-D images of the face in the neutral state under an orthographic projection model; 

calculating relative distances of the face to the camera in the 2-D images to conform 
to the 2-D locations of the markers and their calculated 3-D locations under a 
perspective projection model; 

modifying the 2-D locations of the markers to conform to the calculated relative 
10 distances and the 3-D locations under a perspective projection model; 

recalculating the 3-D locations of the markers to conform to their modified 2-D 
locations under an orthographic projection model; 

repeating the steps of calculating the relative distances, modifying the 2-D locations, 
and recalculating the 3-D locations to satisfy a convergence requirement; and 
15 translating and rotating the 3-D locations so that they correspond to a frontal-looking 

face. 

38. The method of claim 36 wherein the step of calculating the 3-D locations of the local 
markers in each action state comprises the steps of: 

20 estimating the orientation and position of the face in each 2-D image of the action 

state to conform to the 3-D and 2-D locations of the global markers under a 
perspective projection model; and 

calculating the 3-D locations of the local markers to conform to the estimated 
orientation and position of the face and the 2-D locations of the local markers under a 
25 perspective projection model; 

39. The method of claim 13 wherein the step of tracking the 3-D global motion comprises 
the steps of: 

predicting the location of global salient features in a 2-D image; 
30 detecting global salient features in the 2-D image; and 

estimating the 3-D global motion of the face in the 2-D image. 

40. The method of claim 39 wherein th§ step of predicting comprises calculating 2-D 
locations of the global salient features under a perspective projection model using the 
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position and orientation of the face in a previous 2-D image, and the step of detecting 
comprise detecting the global markers. 

41. The method of claim 40 wherein detecting the global markers comprises: 
5 determining visibility indices of global markers; 

designing correlation filters for the global markers; 

detecting the global markers by applying elliptical correlation filters in a 

neighborhood of the global markers; and 

eliminating superfluous and multiple detected locations. 

10 

42. The method of claim 39 wherein the step of estimating comprises calculating the 
position and shape of the face to conform to the 3-D locations and the detected 
locations of the global markers under a perspective projection model. 

15 43. The method of claim 13 wherein the step of tracking the 3-D local motion comprises 
the steps of: 

predicting the location of local salient features; 
detecting local salient features; and 
estimating the 3-D local motion of the face. 

20 

44. The method of claim 43 wherein the local markers are placed on eyebrows and lips. 

45. The method of claim 44 wherein the locations of the local markers comprise 
proximate ends of the eyebrows, corners of the lips, and the upper and lower centers 

25 of each lip. 

46. The method of claim 43 wherein the step of predicting the locations of local markers 
comprises calculating the locations of the local markers using the position, 
orientation, and action values of the face in a previous 2-D image and the step of 

30 detecting comprise detecting the global markers. 

47. The method of claim 44 wherein detecting the local markers comprise: 
determining visibility indices of local markers; 

designing correlation filters for the local markers; 
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detecting the local markers by applying elliptical correlation filters in a neighborhood 
of the local markers; and 

eliminating superfluous and multiple detected locations. 

48. The method of claim 43 wherein the step of estimating comprises: 

finding 3-D locations of local markers to conform to the detected 2-D locations of the 
local markers; 

calculating an action vector representing the weights of facial actions in the 2-D 
image conforming to the found 3-D locations of local markers and the 3 -ID locations 
of the local markers for the neutral and the action states under a perspective projection 
model. 

49. The method of claim 48 wherein the step of calculating an action vector comprises the 
steps of : 

calculating the difference between the 2-D locations of the local markers detected in 
an image and the 2-D locations of the same markers corresponding to the neutral face; 
modifying the difference to conform to the orthographic projection; 
calculating the 3-D displacements of the local markers with respect to their location in 
the neutral face; and 

calculating the amount of facial actions conforming to the 3-D displacements of the 
local markers. 

50. The method of claim 48 wherein the step of calculating an action vector comprises the 
steps of : 

calculating the 2-D locations of the local markers corresponding to the neutral face 
using the global motion found for the current image; 

calculating the 2-D locations of the local markers corresponding to the action faces 
using the global motion found for the current image; 

calculating the distance between the detected locations, the distance between the 
neutral locations, and the distance between the action locations of the markers at the 
right and left corners of the lips; 

calculating the distance between the detected locations, the distance between the 
neutral locations, and the distance between the action locations of the markers at the 
upper and lower center of lips; 
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calculating the distance between the detected locations, the distance between the 
neutral locations, and the distance between the action locations of the markers at the 
proximate ends of the eyebrows; 

determining the fractional displacements of the local markers for the lips area and for 
the eyebrows area; and 

determining action mode and amount for the lips area and for the eyebrows area based 
on the fractional displacements of the local markers. 
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ABSTRACT 



5 A method for tracking the motion of a person's face for the purpose of animating 

a 3-D face model of the same or another person is disclosed. The 3-D face model carries 
both the geometry (shape) and the texture (color) characteristics of the person's face. The 
shape of the face model is represented via a 3-D triangular mesh (geometry mesh), while 
the texture of the face model is represented via a 2-D composite image (texture image). 

10 Both the global motion and the local motion of the person's face are tracked. Global 
motion of the face involves the rotation and the translation of the face in 3-D. Local 
motion of the face involves the 3-D motion of the lips, eyebrows, etc., caused by speech 
and facial expressions. The 2-D positions of salient features of the person's face and/or 
markers placed on the person's face are automatically tracked in a time-sequence of 2-D 

15 images of the face. Global and local motion of the face are separately calculated using the 
tracked 2-D positions of the salient features or markers. Global motion is represented in a 
2-D image by rotation and position vectors while local motion is represented by an action 
vector that specifies the amount of facial actions such as smiling-mouth, raised-eyebrows, 
etc. 

20 
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DETERMINE SURFACE NORMALS FOR LOCAL SALIENT FEATURES 



FIG. 2 
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